Week 2: Regression + Intro to Statistical Models

PS 818 - Statistical Models

Anton Strezhnev

University of Wisconsin-Madison

September 8, 2025

\[ \require{cancel} \DeclareMathOperator*{\argmin}{arg\,min} \]

\[ \DeclareMathOperator*{\argmax}{arg\,max} \]

Previously

  • Estimands, Estimators, Estimates
  • Properties of (frequentist) estimators
    • Bias
    • Variance
    • Consistency
  • How to articulate uncertainty
    • What does it mean to say that your 95% confidence interval is \([0.40, 0.60]\)?
    • How does it connect to hypothesis testing?

This week

  • Linear regression and the “Best Linear Predictor”
  • Classical statistical theory for OLS
    • When is it unbiased/consistent
    • How do we do inference?
    • Gauss-Markov assumptions
  • Introduction to likelihood inference

Regression

  • Rather than just estimating a population mean, we are more typically interested in some population conditional expectation \(\mathbb{E}[Y|X]\)
    • \(Y_i\): Outcome/response/dependent variable
    • \(X_i\): Vector of regressor/independent variables
  • “How does the expected value of \(Y\) differ across different values of \(X\)?”
  • Suppose we observe \(N\) paired observations of \(\{Y_i, X_i\}\).
    • How do we construct a “good” estimator of \(\mathbb{E}[Y|X]\)?
    • What assumptions do we have to make to get…consistency…unbiasedness…efficiency?

Example: Predicting the 2016 election

  • How do different parts of the U.S. vary in their election outcomes?
    • Knowing something about the region, can I predict what share voted for Clinton in the 2016 election?
election <- read_csv("data/prez_election_2016_vote.csv")
election$voteshare <- 100*(election$candidatevotes/election$totalvotes)
election <- election %>% filter(!is.na(voteshare)&!is.na(pctWhiteNH))
  • What is the expected vote share (\(Y\)) conditional on the share of county residents who are non-hispanic white (\(X\)) ?

Plot the data!

Binning

  • We could coarsen \(X\) and just calculate means within each bin!
    • In fact…this is what we do with large datasets anyway…and this ends up also being a regression!
election$pctWhiteNH_bins <- cut(election$pctWhiteNH, 10)
election_means <- election %>% group_by(pctWhiteNH_bins) %>%
  summarize(voteshare = mean(voteshare, na.rm=T), pctWhiteNH = median(pctWhiteNH, na.rm=T))

Binning

Best Linear Predictor (BLP)

  • The Best Linear Predictor or population regression is a function of \(\mathbf{X}\)

\[m(x) = x^\prime\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dotsc\]

  • The \(\beta\) parameters minimize the expected prediction error

\[\beta = \argmin_b \ \mathbb{E}[(Y_i - X_i^\prime b)^2]\]

  • Note that this is a population quantity
    • \(m(x)\) is an estimand like any other estimand we’ve talked about
    • \(m(x)\) and \(E[Y|X]\) might differ substantially!

Best Linear Predictor (BLP)

  • Is there a closed form expression for the population regression coefficients \(\beta\)?
    • Yes - with some algebra (solving for the first order conditions), we get

\[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]

  • Our linear projection becomes

\[m(X_i) = X_i^\prime\beta = X_i^\prime (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]

Properties

  • What have we assumed so far?
    • Finite mean/variance/covariances for \(Y\) and \(X\)
    • Columns of \(X\) are linearly independent (no perfect collinearity)

Projection error

  • What can we say about the difference between the BLP and \(Y\)?

\[e_i = Y_i - m(X_i)\]

  • Rewriting, we have

\[Y_i = X_i^\prime\beta + e_i\]

Projection error

  • We can show that the projection error is mechanically uncorrelated with the predictors

\[\begin{align*} \mathbb{E}[X_i e_i] &= \mathbb{E}[X_i(Y_i - X_i^\prime\beta)]\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime]\beta\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i^\prime](\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\\ &= \mathbb{E}[X_iY_i] - \mathbb{E}[X_iY_i] = 0\\ \end{align*}\]

  • Since \(X_{ik} = 1\) is typically one of the regressors (the “intercept”), this also implies \(\mathbb{E}[e_i] = 0\)

Projection error

  • This means that the covariance of \(X_i\) and the projection error \(e_i\) is zero!

\[\begin{align*} Cov(X_i, \epsilon) = \mathbb{E}[X_ie_i] - E[X_i]E[e_i] = 0 - 0 = 0 \end{align*}\]

  • Importantly, we have derived these properties on the Best Linear Predictor and the projection error without making any assumptions on the error itself!
    • This is mechanically true by the way we’ve defined the BLP

Estimating the BLP

  • Remember, \(m(X)\) is a population quantity - can we come up with an estimator in our sample that is consistent for it?

    • plug-in principle - Where we have population expectations, plug in their sample equivalents
  • Our estimand

    \[\beta = (\mathbb{E}[X_iX_i^\prime])^{-1}\mathbb{E}[X_iY_i]\]

  • Our estimator

    \[\hat{\beta} = \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n X_iY_i\bigg)\]

  • Often written as:

    \[\hat{\beta} = (\mathbf{X}^\prime\mathbf{X})^{-1}(\mathbf{X}^\prime Y)\]

Estimating the BLP

  • Is \(\hat{\beta}\) consistent for \(\beta\)?
  • Let’s write \(\hat{\beta}\) as a function of \(\beta\) and the projection error

\[\hat{\beta} = \beta + \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n X_ie_i\bigg) \]

  • We can use the weak law of large numbers

    • Sample averages of i.i.d. random variables converge in probability to their population expectations

    \[\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime \overset{p}{\to} \mathbb{E}[X_iX_i^\prime]\]

    \[\frac{1}{n} \sum_{i=1}^n X_ie_i \overset{p}{\to} \mathbb{E}[X_ie_i] = 0\]

Estimating the BLP

  • Plugging it all in, we have

    \[\hat{\beta} \overset{p}{\to} \beta\]

  • Ordinary least squares is consistent for the best linear predictor under relatively mild assumptions

    • i.i.d. sample (can weaken this with LLNs for dependent random variables)
    • finite moments (true for most outcomes and regressors you’ll work with)
    • no perfect collinearity (you can check if this is a problem!)
  • What have we not assumed

    • Linear CEF
    • Homoskedastic errors
    • Normal outcome

Asymptotic normality

  • In large samples, what is the distribution of \(\hat{\beta}\)?

\[\sqrt{n}(\hat{\beta} - \beta) = \bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1}\bigg(\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i\bigg)\]

  • We know that \(\bigg(\frac{1}{n} \sum_{i=1}^n X_iX_i^\prime\bigg)^{-1} \overset{p}{\to} E[X_iX_i^\prime]^{-1}\)

Asymptotic normality

  • We can apply the CLT to the other term. We know the expectation is zero, what about the variance?

    \[Var(X_ie_i) = \mathbb{E}[X_i e_i (X_i e_i)^\prime] = \mathbb{E}[e_i^2X_i X_i^\prime]\]

  • By the CLT, we have

    \[\frac{1}{\sqrt{n}} \sum_{i=1}^n X_ie_i \overset{d}{\to} \mathcal{N}(0, \mathbb{E}[e_i^2X_i X_i^\prime])\]

  • Combining with our other convergence result using Slutsky’s theorem gives

    \[\sqrt{n}(\hat{\beta} - \beta) \overset{d}{\to} \mathcal{N}\bigg(0, (E[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (E[X_iX_i^\prime])^{-1}\bigg)\]

Variance estimation

  • Again, we have a bunch of population expectations, let’s plug in their sample equivalents!

  • Our estimand

    \[Var(\hat{\beta}) = \frac{1}{n} (E[X_iX_i^\prime])^{-1} (\mathbb{E}[e_i^2X_i X_i^\prime]) (E[X_iX_i^\prime])^{-1}\]

  • Our estimator

    \[\widehat{Var(\hat{\beta})} = \frac{1}{n} \bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\bigg(\frac{1}{n} \sum_{i=1}^n \hat{e_i}^2 X_i X_i^\prime \bigg)\bigg(\frac{1}{n} \sum_{i=1}^n X_i X_i^\prime \bigg)^{-1}\]

Variance estimation

  • This is often known as the “sandwich” estimator

\[\widehat{Var(\hat{\beta})} = (\mathbf{X}^\prime\mathbf{X})^{-1} (X^\prime \hat{\Sigma} \mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] where \(\hat{\Sigma} = \text{diag}(\hat{e_1}^2, \hat{e_2}^2, \dotsc \hat{e_n}^2)\)

  • It’s also referred to as the heteroskedasticity-consistent robust variance estimator
    • We have made no assumptions on the variance of \(e_i\) – we’re just plugging in the regression residuals.

BLP vs. Binning

Inference

  • Let’s actually calculate the “robust” standard errors by hand
linreg <- lm(voteshare ~ pctWhiteNH, data=election)
summary(linreg)

Call:
lm(formula = voteshare ~ pctWhiteNH, data = election)

Residuals:
   Min     1Q Median     3Q    Max 
-35.82  -8.31  -0.85   7.20  45.08 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  67.8083     0.9059    74.9   <2e-16 ***
pctWhiteNH   -0.4617     0.0112   -41.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.2 on 3112 degrees of freedom
Multiple R-squared:  0.354, Adjusted R-squared:  0.353 
F-statistic: 1.7e+03 on 1 and 3112 DF,  p-value: <2e-16

Inference

  • Get the \(\mathbf{X}\) matrix
X <- model.matrix(voteshare ~ pctWhiteNH, data=election)
head(X)
  (Intercept) pctWhiteNH
1           1       77.2
2           1       83.5
3           1       46.8
4           1       75.0
5           1       88.9
6           1       21.9

Inference

  • Get the residuals
e <- residuals(linreg)
  • “sandwich” estimator
bread <- solve(t(X)%*%X)
meat <- (t(X)%*%diag(e^2)%*%X)
hc0 <- bread%*%meat%*%bread
hc0
            (Intercept) pctWhiteNH
(Intercept)      1.0707  -0.012342
pctWhiteNH      -0.0123   0.000149

Inference

  • Let’s get the standard errors
sqrt(diag(hc0))
(Intercept)  pctWhiteNH 
     1.0348      0.0122 
  • Compare to the classical ones
sqrt(diag(vcov(linreg)))
(Intercept)  pctWhiteNH 
     0.9059      0.0112 
  • You can get the robust variance estimator using lm_robust in estimatr
lm_robust(voteshare ~ pctWhiteNH, data=election, se_type="HC0")
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)   67.808     1.0348    65.5  0.00e+00   65.779   69.837 3112
pctWhiteNH    -0.462     0.0122   -37.9 1.31e-258   -0.486   -0.438 3112

Classical regression

  • We’ve shown that the OLS estimator is consistent and asymptotically normal for the Best Linear Predictor

    • What about for estimating the conditional expectation function? \(E[Y|X]\)
  • To do inference on the CEF, we assume that it’s equal to the BLP!

  • Assumption 1: Linearity

    \[Y_i = X_i^\prime\beta + \epsilon_i\]

  • Assumption 2: Strict exogeneity of the errors

    \[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]

  • What does this get us?

    • Finite-sample properties of OLS (unbiasedness in addition to consistency!)
  • When is linearity always true?

    • When we have a fully-saturated model!

Classical regression

  • Assumption 3: No perfect collinearity
    • \(\mathbf{X}^{\prime}\mathbf{X}\) is invertible
    • \(\mathbf{X}\) has full column rank
  • Again, when does this fail? How can we check?

Classical regression

  • Under assumptions 1-3, our OLS estimator \(\hat{\beta}\) is also unbiased for \(\beta\)
  • Let’s do a quick proof for unbiasedness

\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \end{align*}\]

  • Then we can obtain the conditional expectation of \(\mathbb{E}[\hat{\beta} | \mathbf{X}]\)

\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\ &= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\ &= \beta \end{align*}\]

Classical regression

  • Lastly, by law of total expectation

    \[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]

  • Therefore

    \[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]

  • Note that the assumptions here are slightly stronger than what we needed for consistency for the BLP

  • But what have we not assumed?

    • Anything about the distribution of the errors (aside from finite 1st/2nd moments)

Classical regression

  • Assumption 4 - Spherical errors

    \[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0\\ 0 & \sigma^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} = \sigma^2 \mathbf{I}\]

  • Benefits

    • Simple, unbiased estimator for the variance of \(\hat{\beta}\)
    • Completes Gauss-Markov assumptions \(\leadsto\) OLS is BLUE (Best Linear Unbiased Estimator)
  • Drawbacks

    • Basically never is true

Classical regression

  • Under spherical errors, the variance formula simplifies

    \[Var(\beta) = (\mathbf{X}^\prime\mathbf{X})^{-1} (\mathbf{X}^\prime \sigma^2\mathbf{X}) (\mathbf{X}^\prime\mathbf{X})^{-1}\] \[Var(\beta) = \sigma^2 (\mathbf{X}^\prime\mathbf{X})^{-1}\]

  • All we need is an estimator for \(\sigma^2\)

    • A consistent one is just the sample variance of the residuals
    • Can do better by including a “degrees of freedom” correction – the within-sample variance is always going to under-estimate the population variance

Classical regression

  • An unbiased estimator for \(\sigma^2\) is the sample variance of the residuals (mean squared error) adjusted by a degrees of freedom correction

    \[\hat{\sigma^2} = \frac{1}{n-k} \sum_{i=1}^n (Y_i - \hat{Y_i})^2\]

  • \(k\) is the number of parameters. We refer to \(n-k\) as the “degrees of freedom”

OLS and Projections

  • One way to think about the fitted values from OLS is that they are a projection of the n-dimensional vector \(Y\) into the column space of \(\mathbf{X}\)

    • A column space is the set of all linear combinations of the columns of \(\mathbf{X}\)
  • We define the projection matrix or hat matrix

    \[\mathbf{P}_{\mathbf{X}} = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X^\prime}\]

  • Our fitted values \(\hat{Y}\) are a projection of \(Y\) to the “closest” vector that can be represented as a linear combination of the columns of \(\mathbf{X}\)

    \[\hat{Y} = \mathbf{P}_{\mathbf{X}}Y = \mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1}\mathbf{X^\prime}Y = \mathbf{X}\hat{\beta}\]

OLS is BLUE

  • The Gauss-Markov Theorem
    • Under Assumptions 1-4, we can show that there is no other estimator that for \(\beta\) that is both linear and unbiased which has lower variance than \(\hat{\beta}_{\text{OLS}}\)

OLS is BLUE

  • Proof: Suppose there’s another linear estimator that diverges from the OLS projection of \(Y\) by a non-zero \(k \times n\) matrix \(D\)

    \[\tilde{\beta} = ((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D)Y\]

  • Let’s take the expectation

    \[\mathbb{E}[\tilde{\beta}] = \mathbb{E}[\hat{\beta}_{\text{OLS}}] + \mathbb{E}[DY]\]

  • We know OLS is unbiased. Next, we plug in the expression for \(Y\) (linearity)

    \[\mathbb{E}[\tilde{\beta}] = \beta + \mathbb{E}[D\mathbf{X}\beta] + \mathbb{E}[D\epsilon]\]

OLS is BLUE

  • Pull out the constants and use \(\mathbb{E}[\epsilon] = 0\)

    \[\mathbb{E}[\tilde{\beta}] = \beta(\mathbf{I} + D\mathbf{X})\]

  • If \(\tilde{\beta}\) is unbiased, then

    \[\beta(\mathbf{I} + D\mathbf{X}) = 0\]

  • This holds for all values of \(\beta\) only if \(D\mathbf{X} = 0\)

OLS is BLUE

  • Under homoskedasticity, the variance is

    \[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)^{\prime}\]

  • Using properties of matrix transposes

    \[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime + D\bigg)\bigg(\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + D^\prime\bigg)\]

  • Expanding

    \[Var(\tilde{\beta}) = \sigma^2\bigg((\mathbf{X}^\prime\mathbf{X})^{-1} \mathbf{X}^\prime\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + (\mathbf{X}^\prime\mathbf{X})^{-1}(D\mathbf{X})^\prime + D\mathbf{X}(\mathbf{X}^\prime\mathbf{X})^{-1} + DD^\prime\bigg)\]

OLS is BLUE

  • Using \(D\mathbf{X} = 0\) from above along with the definition of the variance of \(\hat{\beta}_{\text{OLS}}\)

    \[Var(\tilde{\beta}) = Var(\hat{\beta}_{\text{OLS}}) + \sigma^2DD^\prime\]

  • The variance of our alternative linear, unbiased estimator exceeds the variance of OLS by the positive semi-definite matrix \(\sigma^2DD^\prime\)

  • Therefore OLS is the lowest-variance linear unbiased estimator

Classical regression

  • Assumption 5 - Normality of the errors

\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]

  • Not necessary even for Gauss-Markov assumptions
  • Not needed to do asymptotic inference on \(\hat{\beta}\)
    • Why? Central Limit Theorem!
  • Benefits?
    • Finite-sample inference (no reliance on asymptotics for hypothesis tests)
    • Inference for predictions from the model.
  • Drawbacks
    • Again, almost never true unless our outcome is a sum or mean of a large number of i.i.d. random variables

Regression review

  • What do we need for OLS to be consistent for the “best linear approximation” to the CEF?
    • Very little!
  • What do we need for OLS to be consistent and unbiased for the conditional expectation function?
    • Truly linear CEF
    • But still no assumptions about the outcome distribution!
  • What do we need to do inference on \(\hat{\beta}\)?
    • We almost never assume homoskedasticity because “robust” SE estimators are ubiquitous
    • Even some forms of error correlation are permitted (“cluster” robust SEs)
    • Sample sizes are usually large enough where Central Limit Theorem implies a normal sampling distribution is a reasonable approximation.

OLS by hand

  • Our point estimate
X <- cbind(1, election$pctWhiteNH)
beta <- solve(t(X)%*%X)%*%t(X)%*%election$voteshare
  • Our variance estimator
err <- election$voteshare - X%*%beta
sigma_2_hat <- (1/(nrow(X)-ncol(X)))*sum(err^2)
vcov_beta <- sigma_2_hat*solve(t(X)%*%X)
  • Results
beta
       [,1]
[1,] 67.808
[2,] -0.462
sqrt(diag(vcov_beta))
[1] 0.9059 0.0112
tidy(linreg)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   67.8      0.906       74.9 0        
2 pctWhiteNH    -0.462    0.0112     -41.3 2.98e-297

Partial regression

  • It can be difficult to understand the mechanics of OLS with many regressors.

    • With a single regressor, we had a nice expression for the coefficient on \(X\)

    \[Y_i = \beta_0 + \beta_1 X_i + \epsilon\]

    \[\hat{\beta} = \frac{\widehat{Cov}(Y_i, X_i)}{\widehat{Var}(X_i)}\]

  • Can we get something like this in the multivariate case?

    \[Y = \mathbf{X}\beta + \epsilon\]

Partial regression

  • With multiple predictors, we can write the regression coefficient on any of our regressors in terms of partial regressions.

    • Partition \(\mathbf{X}\) into two sets of covariates \(\mathbb{X}_1\) and \(\mathbb{X}_2\)

    \[Y = \mathbb{X}_1\beta_1 + \mathbb{X}_2\beta_2 + \epsilon\]

  • The Frisch-Waugh-Lovell theorem, states that the OLS estimator of \(\hat{\beta_1}\) can be recovered by:

  1. Obtaining \(Y^{\perp \mathbb{X}_2}\), the residuals from a regression of \(Y\) on \(\mathbb{X}_2\)
  2. Obtaining \(\mathbb{X}_1^{\perp \mathbb{X}_2}\), the residuals from a regression of \(\mathbb{X}_1\) on \(\mathbb{X}_2\)
  3. Regressing \(Y^{\perp \mathbb{X}_2}\) on \(\mathbb{X}_1^{\perp \mathbb{X}_2}\)
  • Core theoretical mechanics of many recent methods papers!
    • Cinelli and Hazlett (2020) sensitivity analysis
    • Much of the “new DiD” literature (esp. Goodman-Bacon decomposition, Wooldridge’s paper on “two-way Mundlak”)

Partial regression

  • Suppose I wanted to “control” for state fixed effects
linreg_state <- lm(voteshare ~ pctWhiteNH + as.factor(state), data=election)
tidy(linreg_state) %>% filter(term == "pctWhiteNH")
# A tibble: 1 × 5
  term       estimate std.error statistic p.value
  <chr>         <dbl>     <dbl>     <dbl>   <dbl>
1 pctWhiteNH   -0.650    0.0106     -61.5       0

Partial regression

  • Regression of \(Y\) on the fixed effects
election$partial_y <- residuals(lm(voteshare ~ as.factor(state), data=election))
  • Regression of \(X\) on the fixed effects
election$partial_x <- residuals(lm(pctWhiteNH ~ as.factor(state), data=election))
  • FWL
tidy(lm(partial_y ~ partial_x, data=election)) %>% filter(term == "partial_x")
# A tibble: 1 × 5
  term      estimate std.error statistic p.value
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
1 partial_x   -0.650    0.0105     -62.0       0

Partial regression

  • With fixed effects, partialing is equivalent to de-meaning \(Y\) and \(X\) within unit
    • Sometimes called the “Mundlak” regression
voteshare_means <- election %>% group_by(state) %>% summarize(voteshare_state = mean(voteshare))
pctWhiteNH_means <- election %>% group_by(state) %>% summarize(pctWhiteNH_state = mean(pctWhiteNH))

election <- election %>% left_join(voteshare_means, by="state")
election <- election %>% left_join(pctWhiteNH_means, by="state")

election$demeaned_y <- election$voteshare - election$voteshare_state
election$demeaned_x <- election$pctWhiteNH - election$pctWhiteNH_state

tidy(lm(demeaned_y ~ demeaned_x, data=election)) %>% filter(term == "demeaned_x")
# A tibble: 1 × 5
  term       estimate std.error statistic p.value
  <chr>         <dbl>     <dbl>     <dbl>   <dbl>
1 demeaned_x   -0.650    0.0105     -62.0       0

Statistical models

Defining a statistical model

  • In the regression setting we tried to make as few assumptions about the data-generating process as possible.
    • Our goal is just to estimate and conduct inference on \(E[Y|X]\).
  • But what if we wanted to make further probabilistic statements about other quantities beyond \(\beta\)
    • (e.g.) Can we provide a distribution for \(Y_{n+1}\), the “next” observation given \(X_{n+1}\)?
    • If we’re willing to make more assumptions about the data-generating process, we can do a lot more!
  • Statistical models specify the data-generating process in terms of systematic and stochastic components.
    • Systematic elements are functions known constants and unknown parameters
    • Stochastic elements are draws from probability distributions
  • We will be primarily working with parametric models
    • The data will be assumed to come from a particular family of probability distributions
    • The “structure” of the model is assumed fixed (the number of parameters does not grow with the size of the data).

The linear model

  • It is common to see the linear model written in its fully parametric form.

  • Stochastic:

    \[Y_i \sim \text{Normal}(\mu_i, \sigma^2)\]

  • Systematic:

    \[\mu_i = X_i^{\prime}\beta\]

  • What’s assumed to be known?

    • \(\mathbf{X}\)
  • What’s assumed to have a particular distribution?

    • \(Y\)
  • We are interested in estimating and conducting inference on the parameters: \(\beta\) and (less importantly) \(\sigma^2\).

General model notation

  • We can specify a broad set of models for \(Y_i\) using this framework

  • Stochastic

    \[Y_i \sim f(\theta_i, \alpha)\]

  • Systematic

    \[\theta_i = g(X_i, \beta)\]

  • What are these quantities?

    • \(Y_i\) is a random variable
    • \(f()\) denotes the distribution of that random variable
    • \(\theta_i\) and \(\alpha\) are parameters of that distribution
    • \(g()\) is some function
    • \(X_i\) are observed, known constants (e.g. regressors)
    • \(\beta\) are parameters of interest
  • We will spend some time with a particular class of models called “Generalized Linear Models” where the systematic component has the form

    \[\theta_i = g(X_i^{\prime}\beta)\]

Types of distributions

  • Normal

  • Continuous on an unbounded support \((-\infty, \infty)\)

  • Two parameters: Mean \(\mu\) and Variance \(\sigma^2\)

  • Probability Density Function (PDF)

\[f_N(x;\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left\{- \frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2 \right\}\]

Types of distributions

  • Poisson

  • Discrete, defined on the support of the natural numbers (positive integers + zero)

  • Single parameter: Mean and Variance \(\lambda\)

  • Probability Density Function (PDF)

\[f(x; \lambda) = \frac{\lambda^x\exp\{-\lambda\}}{x!}\]

Types of distributions

  • Binomial

  • Discrete, defined on the support of integers from \(\{0, 1, 2, \dotsc, n\}\) (model the sum of repeated i.i.d. coin flips.)
  • Two parameters: \(p\) probability of success in \(n\) trials
    • Special case where \(n=1\) trials is typically called the “Bernoulli”
  • Probability Density Function (PDF)

\[f(x; p, n) = {n \choose x} p^x(1-p)^{n-x}\]

Likelihood inference

Learning about the unknown

  • Suppose that I want to learn about an unobserved parameter from a sample of observations.
    • I assume a particular statistical model for the data
    • Example: The linear model from earlier!
  • Frequentist approach
    • Unobserved parameters are fixed constants
    • Data are random variables
  • We want to construct an estimator that is a function of the data and which has desirable properties
    • Consistency: As our sample size gets larger, the estimator converges (in probability) to the target parameter.
    • Asymptotic normality: In large samples, the distribution of our estimator is normal (ideally with a variance we can estimate as well!)
  • Can we come up with a generic framework that yields a “good” estimator for a large class of statistical models?

The Likelihood Function

  • Consider data \(\mathbf{y} = \{y_1, y_2, \dotsc, y_n\}\) which we assume comes from a known distribution with density \(f(\mathbf{y} | \theta)\) with unknown parameter \(\theta\)

    • \(f()\) could be Bernoulli, Normal, Poisson, Gamma, Beta, etc… – in this setting it is a known distribution
    • \(\theta\) is an unknown parameter that lies in some space \(\Theta\) of possible values.
  • The likelihood function is a function of \(\theta\) that is evaluated at the observed values of \(\mathbf{y}\)

    \[\mathcal{L}(\theta| \mathbf{y}) = f(\mathbf{y} | \theta)\]

  • Given some input \(\theta\), the likelihood function returns the density of the observed data evaluated at that value.

  • The likelihood function is not a probability density

    • a pdf is a function that takes \(x\) as an input
    • a likelihood is a function that takes \(\theta\) as an input

The Likelihood Function

  • The likelihood function is not \(f(\theta | \mathbf{y})\)

    • That statement doesn’t even make sense in a frequentist framework - parameters are constant
  • Even when we (later) move to a Bayesian framework \(f(\theta | \mathbf{y}) \neq f(\mathbf{y} | \theta)\)

  • Remember Bayes’ Rule:

    \[f(\theta | \mathbf{y}) = \frac{f(\mathbf{y} | \theta)}{f(\mathbf{y})} \times f(\theta)\]

  • Or alternatively, since \(f(\mathbf{y})\) is just a normalizing constant, you’ll see this written as:

    \[\underbrace{f(\theta | \mathbf{y})}_{\text{posterior}} \propto \underbrace{f(\mathbf{y}|\theta)}_{\text{likelihood}} \times \underbrace{f(\theta)}_{\text{prior}}\]

The Likelihood Function

  • We often make the assumption that our data are independently and identically distributed

    • \(y_i \underset{{\text{i.i.d}}}{\sim} f(y_i | \theta)\)
  • This allows us to factor the likelihood

    \[\mathcal{L}(\theta|\mathbf{y}) = f(\mathbf{y} | \theta) = \prod_{i=1}^n f(y_i | \theta)\]

  • Since our eventual goal will be to find an optimum of the likelihood, we can apply any monotonic function to it since that preserves maxima/minima.

    • The main one we’ll apply is the logarithm
    • Why? Because logs turn annoying-to-work-with products into easier-to-work-with sums!
  • The log-likelihood is typically denoted \(\ell(\theta| \mathbf{y})\)

    \[\ell(\theta|\mathbf{y}) = \log f(\mathbf{y}|\theta) = \sum_{i=1}^n \log f(y_i | \theta)\]

MLE

  • We want to come up with an estimator \(\hat{\theta}\) for the parameter \(\theta\)
    • \(\hat{\theta}\) is a function of the data (like a sample mean or OLS coefficient)
    • Is there a principled way to pick \(\hat{\theta}\) that has provably “good” properties across a wide variety of models?
  • The Maximum Likelihood Estimator (MLE) \(\hat{\theta}\) is defined as the value of \(\theta\) that yields the optimum value of the likelihood function

\[\hat{\theta} = \argmax_\theta \ \log f(\mathbf{y}|\theta)\]

  • Under i.i.d. observations, the MLE has some desirable properties:
    • The MLE \(\hat{\theta}\) is consistent for the true parameter \(\theta\)
    • It’s asymptotically normal: \(\sqrt{n}(\hat{\theta} - \theta) \rightarrow \text{Normal}(0, \frac{1}{\mathcal{I}(\theta)})\)
    • It’s asymptotically efficient: It’s variance approaches the Cramer-Rao lower-bound \(\frac{1}{n\mathcal{I}(\theta)}\)

Score

  • Before illustrating consistency, we need to define some additional functions of the log-likelihood.

  • The first is the score or the gradient of \(\ell(\theta|\mathbf{y})\) with respect to \(\theta\)

    \[\mathbf{S} = \frac{\partial}{\partial\theta} \log f(\mathbf{y}|\theta)\]

  • With i.i.d. data, you’ll often see the score written in terms of the score for an individual observation \(i\)

    \[\mathbf{S}_i = \frac{\partial}{\partial\theta} \log f(y_i|\theta)\]

  • At the value of the true parameter \(\theta_0\), the expected value of the score is \(0\) (under some regularity conditions)

    \[\mathbb{E}\bigg[\frac{\partial}{\partial\theta} \log f(y_i|\theta_0)\bigg] = \frac{\partial}{\partial\theta} \mathbb{E}[\log f(y_i|\theta_0)] = 0\]

Consistency

  • With this definition of the score, we can show consistency of the MLE under i.i.d. observations.

  • By the weak law of large numbers

    \[\frac{1}{n} \sum_{i=1}^n \log f(y_i | \theta) \underset{p}{\rightarrow} \mathbb{E}[\log f(y_i | \theta)]\]

  • The MLE is an optimizer of the left-hand side.

    • Therefore it converges in probability to the optimizer of the right-hand side
  • And we just showed that the score is \(0\) at the true value of \(\theta\), denoted \(\theta_0\)

    • Therefore \(\theta_0\) is an extremum of the right-hand side (under a few additional regularity conditions)
  • Therefore the MLE \(\hat{\theta}\) is consistent for the true value \(\theta_0\)

Consistency

  • There are two important conditions for consistency that can be violated in practice

  • Identifiability

    • No “plateaus” in the log-likelihood.
    • Two different values of \(\theta\) can’t both maximize the log-likelihood

    \[f(\mathbf{y}|\theta) \neq f(\mathbf{y} | \theta_0) \ \forall \ \theta \neq \theta_0\]

  • Fixed parameter space

    • The dimensionality of \(\theta\) stays fixed and does not depend on \(n\)
    • Violated (e.g.) in ideal point models (new legislators mean new ideal points)

Next week!

  • More on likelihood inference theory!
    • Asymptotic Normality
    • Sampling variance of the MLE
    • Cramer-Rao Lower Bound
  • Generalized Linear Models
    • Extending the tools of “linear regression” to other outcome distributions.